We will try to learn from data using a very simple example of tossing a coin. We will first generate some data (30% heads and 70% tails) and will try to learn the CPD of the coin using Maximum Likelihood Estimator and Bayesian Estimator with Dirichlet prior.


In [3]:
# Generate data
import numpy as np
import pandas as pd

raw_data = np.array([0] * 30 + [1] * 70) # Representing heads by 0 and tails by 1
data = pd.DataFrame(raw_data, columns=['coin'])
print(data)


    coin
0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10     0
11     0
12     0
13     0
14     0
15     0
16     0
17     0
18     0
19     0
20     0
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     0
..   ...
70     1
71     1
72     1
73     1
74     1
75     1
76     1
77     1
78     1
79     1
80     1
81     1
82     1
83     1
84     1
85     1
86     1
87     1
88     1
89     1
90     1
91     1
92     1
93     1
94     1
95     1
96     1
97     1
98     1
99     1

[100 rows x 1 columns]

In [5]:
# Defining the Bayesian Model
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator

model = BayesianModel()
model.add_node('coin')

# Fitting the data to the model using Maximum Likelihood Estimator
model.fit(data, estimator=MaximumLikelihoodEstimator)
print(model.get_cpds('coin'))


+---------+-----+
| coin(0) | 0.3 |
+---------+-----+
| coin(1) | 0.7 |
+---------+-----+

In [6]:
# Fitting the data to the model using Bayesian Estimator with Dirichlet prior with equal pseudo counts.
model.fit(data, estimator=BayesianEstimator, prior_type='dirichlet', pseudo_counts={'coin': [[50], [50]]})
print(model.get_cpds('coin'))


WARNING:root:Replacing existing CPD for coin
+---------+-----+
| coin(0) | 0.4 |
+---------+-----+
| coin(1) | 0.6 |
+---------+-----+

We can see that we get the results as expected. In the maximum likelihood case we got the probability just based on the data where as in the bayesian case we had a prior of $ P(H) = 0.5 $ and $ P(T) = 0.5 $, therefore with 30% heads and 70% tails in the data we got a posterior of $ P(H) = 0.4 $ and $ P(T) = 0.6 $.

Similarly we can learn in case of more complex model. Let's take an example of the student model and compare the results in case of Maximum Likelihood estimator and Bayesian Estimator.

TODO: Add fig for Student example


In [14]:
# Generating radom data with each variable have 2 states and equal probabilities for each state
import numpy as np
import pandas as pd

raw_data = np.random.randint(low=0, high=2, size=(1000, 5))
data = pd.DataFrame(raw_data, columns=['D', 'I', 'G', 'L', 'S'])
print(data)


     D  I  G  L  S
0    1  0  0  0  0
1    1  1  1  1  1
2    0  0  1  0  0
3    1  1  0  1  1
4    1  0  0  0  0
5    1  0  1  0  0
6    1  0  1  1  0
7    0  1  1  1  1
8    1  1  1  0  0
9    0  0  1  1  1
10   1  0  1  1  0
11   0  0  1  0  0
12   0  0  1  1  0
13   0  1  0  0  1
14   1  0  1  0  0
15   1  0  0  1  0
16   1  1  0  0  1
17   1  0  1  0  0
18   1  0  1  0  1
19   0  1  1  0  1
20   1  0  0  1  0
21   0  1  1  0  1
22   1  1  1  1  1
23   1  0  1  0  0
24   1  1  0  0  0
25   0  1  1  1  1
26   0  1  0  1  1
27   1  0  1  0  1
28   0  0  0  0  1
29   0  0  0  1  1
..  .. .. .. .. ..
970  0  0  1  1  0
971  0  1  0  1  0
972  1  1  1  1  0
973  0  0  0  0  0
974  1  0  1  0  0
975  1  0  0  1  0
976  1  0  1  0  1
977  0  0  1  0  0
978  0  1  1  0  1
979  0  0  0  0  1
980  0  1  0  0  1
981  0  1  0  0  1
982  1  1  1  1  1
983  1  1  0  1  0
984  0  0  1  0  1
985  1  0  1  0  0
986  0  1  0  1  1
987  0  0  0  1  0
988  1  0  1  0  1
989  1  0  1  0  1
990  1  1  1  1  0
991  0  1  0  0  1
992  1  0  0  1  0
993  0  1  0  1  0
994  0  1  1  1  0
995  0  1  0  1  0
996  0  0  0  0  0
997  0  0  1  1  1
998  0  1  1  1  1
999  1  0  1  1  0

[1000 rows x 5 columns]

In [15]:
# Defining the model
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator

model = BayesianModel([('D', 'G'), ('I', 'G'), ('I', 'S'), ('G', 'L')])

# Learing CPDs using Maximum Likelihood Estimators
model.fit(data, estimator=MaximumLikelihoodEstimator)
for cpd in model.get_cpds():
    print("CPD of {variable}:".format(variable=cpd.variable))
    print(cpd)


CPD of D:
+------+-------+
| D(0) | 0.501 |
+------+-------+
| D(1) | 0.499 |
+------+-------+
CPD of G:
+------+------+----------------+----------------+----------------+
| D    | D(0) | D(0)           | D(1)           | D(1)           |
+------+------+----------------+----------------+----------------+
| I    | I(0) | I(1)           | I(0)           | I(1)           |
+------+------+----------------+----------------+----------------+
| G(0) | 0.48 | 0.509960159363 | 0.444915254237 | 0.551330798479 |
+------+------+----------------+----------------+----------------+
| G(1) | 0.52 | 0.490039840637 | 0.555084745763 | 0.448669201521 |
+------+------+----------------+----------------+----------------+
CPD of I:
+------+-------+
| I(0) | 0.486 |
+------+-------+
| I(1) | 0.514 |
+------+-------+
CPD of L:
+------+----------------+----------------+
| G    | G(0)           | G(1)           |
+------+----------------+----------------+
| L(0) | 0.489959839357 | 0.501992031873 |
+------+----------------+----------------+
| L(1) | 0.510040160643 | 0.498007968127 |
+------+----------------+----------------+
CPD of S:
+------+----------------+----------------+
| I    | I(0)           | I(1)           |
+------+----------------+----------------+
| S(0) | 0.512345679012 | 0.468871595331 |
+------+----------------+----------------+
| S(1) | 0.487654320988 | 0.531128404669 |
+------+----------------+----------------+

As the data was randomly generated with equal probabilities for each state we can see here that all the probability values are close to 0.5 which we expected. Now coming to the Bayesian Estimator:


In [17]:
# Learning with Bayesian Estimator using dirichlet prior for each variable.

# Note that the values in pseudo_counts need to be specified for each specific state of the CPD 
pseudo_counts = {'D': [[300], [700]], 'I': [[500], [500]], 'G': [[800, 200, 300, 500], [600, 400, 700, 500]], 'L': [[500, 300], [500, 700]], 'S': [[400, 700], [600, 300]]}
model.fit(data, estimator=BayesianEstimator, prior_type='dirichlet', pseudo_counts=pseudo_counts)
for cpd in model.get_cpds():
    print("CPD of {variable}:".format(variable=cpd.variable))
    print(cpd)


WARNING:root:Replacing existing CPD for I
WARNING:root:Replacing existing CPD for S
WARNING:root:Replacing existing CPD for D
WARNING:root:Replacing existing CPD for G
WARNING:root:Replacing existing CPD for L
CPD of D:
+------+--------+
| D(0) | 0.4005 |
+------+--------+
| D(1) | 0.5995 |
+------+--------+
CPD of G:
+------+-------+----------------+----------------+----------------+
| D    | D(0)  | D(0)           | D(1)           | D(1)           |
+------+-------+----------------+----------------+----------------+
| I    | I(0)  | I(1)           | I(0)           | I(1)           |
+------+-------+----------------+----------------+----------------+
| G(0) | 0.736 | 0.741806554756 | 0.732200647249 | 0.748218527316 |
+------+-------+----------------+----------------+----------------+
| G(1) | 0.264 | 0.258193445244 | 0.267799352751 | 0.251781472684 |
+------+-------+----------------+----------------+----------------+
CPD of I:
+------+-------+
| I(0) | 0.493 |
+------+-------+
| I(1) | 0.507 |
+------+-------+
CPD of L:
+------+----------------+----------------+
| G    | G(0)           | G(1)           |
+------+----------------+----------------+
| L(0) | 0.496662216288 | 0.500665778961 |
+------+----------------+----------------+
| L(1) | 0.503337783712 | 0.499334221039 |
+------+----------------+----------------+
CPD of S:
+------+----------------+----------------+
| I    | I(0)           | I(1)           |
+------+----------------+----------------+
| S(0) | 0.436742934051 | 0.423381770145 |
+------+----------------+----------------+
| S(1) | 0.563257065949 | 0.576618229855 |
+------+----------------+----------------+

Since the data was randomly generated with equal probabilities for each state, the data tries to bring the posterior probabilities close to 0.5. But because of the prior we will get the values in between the prior and 0.5.